Engineering posts about Incident Management
Curated summaries and key learnings for engineers working with Incident Management.
How security teams can report cyber risk to boards
The article outlines the importance of translating cyber risk into financial terms to enable boards to make informed decisions regarding security investments. It emphasizes the need for coherent risk...
Triage, ship, debug—all from Slack
Duolingo has developed an AI-powered Slack app that integrates with various tools such as GitHub, Jenkins, and AWS to enhance developer productivity and streamline incident management. The app...
Why AI Security Infrastructure is Now a CMO Priority
The article emphasizes the critical role of AI security infrastructure in modern enterprises, particularly highlighting the launch of Databricks Lakewatch, an innovative security information and...
When DNSSEC goes wrong: how we responded to the .de TLD outage
The article discusses the DNSSEC outage affecting the .de TLD on May 5, 2026, when DENIC published incorrect DNSSEC signatures, leading to widespread SERVFAIL responses from validating resolvers. It...
Code Orange: Fail Small is complete. The result is a stronger Cloudflare network
The article outlines the completion of Cloudflare's 'Code Orange: Fail Small' initiative, aimed at enhancing the resilience and reliability of its network infrastructure. Key improvements include the...
Alert Fatigue Is a Business Risk
The article highlights the critical issue of alert fatigue in enterprise security operations, where the overwhelming volume of alerts leads to significant risks as analysts struggle to prioritize and...
From Incident Counting to SLIs: How DigitalOcean Rethought Availability
The article discusses DigitalOcean's transition from an incident-counting methodology to a more nuanced SLI-based approach for measuring availability. Initially, the company relied on a simplistic...
Trust But Canary: Configuration Safety at Scale
In the Meta Tech Podcast episode featuring Pascal Hartig, the discussion revolves around the strategies employed by Meta's Configurations team to ensure safe configuration rollouts at scale. The...
A one-line Kubernetes fix that saved 600 hours a year
The article discusses a critical performance issue encountered with Kubernetes when managing the Atlantis tool for Terraform changes. The problem stemmed from slow restarts due to a default behavior...
Databricks Announces Lakewatch: New Open, Agentic SIEM
Databricks has introduced Lakewatch, an innovative open security information and event management (SIEM) solution designed to address the limitations of traditional SIEMs, particularly in the context...
From vendors to vanguard: Airbnb’s hard-won lessons in observability ownership
The article outlines Airbnb's transition from a vendor-managed observability platform to a custom in-house solution built on open-source technology, specifically Prometheus. It details the challenges...
Scaling Autonomous Site Reliability Engineering: Architecture, Orchestration, and Validation for a 90,000+ Server Fleet
The article discusses the implementation of an AI-powered Site Reliability Engineer (SRE) agent at Cloudways, which manages a fleet of over 90,000 servers. It outlines the architecture involving an...
Building a security overview dashboard for actionable insights
The article presents a comprehensive overview of a newly developed security dashboard designed to enhance the efficiency of security teams by providing actionable insights rather than mere...
Investigating multi-vector attacks in Log Explorer
The article discusses the complexities of modern multi-vector attacks in cybersecurity, emphasizing the necessity for comprehensive visibility through tools like Cloudflare Log Explorer. It outlines...
Cloudflare outage on February 20, 2026
On February 20, 2026, Cloudflare experienced a significant outage affecting customers using its Bring Your Own IP (BYOIP) service due to a misconfiguration in the Border Gateway Protocol (BGP)...
2025 Q4 DDoS threat report: A record-setting 31.4 Tbps attack caps a year of massive DDoS assaults
The 2025 Q4 DDoS threat report by Cloudflare reveals a significant escalation in DDoS attacks, with a record-setting attack of 31.4 Tbps marking a year of unprecedented assaults. The report...
Route leak incident on January 22, 2026
On January 22, 2026, a misconfiguration in Cloudflare's routing policy led to a significant BGP route leak, affecting both Cloudflare customers and external networks. The incident, which lasted 25...
When protections outlive their purpose: A lesson on managing defense systems at scale
The article outlines the challenges faced by GitHub in managing defense mechanisms that protect the platform from abuse while ensuring legitimate users are not adversely affected. It highlights the...
Securing the Grid: A Practical Guide to Cyber Analytics for Energy & Utilities
The article outlines the critical cybersecurity challenges faced by the Energy & Utilities sector, particularly due to the convergence of IT and operational technology (OT) systems. It emphasizes the...
Code Orange: Fail Small — Our resilience plan following recent incidents
The article outlines Cloudflare's 'Code Orange: Fail Small' initiative aimed at enhancing the resilience of its network following significant outages. It details the incidents that led to the plan,...